title=“Final Project - ALY 6010”
Final Project Report
Intermediate Analytics
ALY 6015
Maheswar Raju
Narasaiah
Prof: Eric Gero
Date:15 January, 2023
1. INTRODUCTION
In this assigment, we are going interpret and evaluate the models using Ames Housing Data.
We are going to further construct and analyze two regression models, interpret their results, and utilize diagnostic methods to identify and resolve any problems with the models.
Objectives Of Project
1: Develop and analyze regression models using established functions and diagnostic methods.
2: Address problems related to overfitting, linearity, multicollinearity and outliers.
3: Utilize automated techniques to determine the most appropriate model from a pool of multiple predictors.”
About the Datasets
We are going to use Ames Housing Data. The data set contains information on 2,930 properties in Ames, Iowa, including columns related to:
2. ANALYSIS
2.1. Load the library and Ames housing dataset
## Load the Library Used
library(magrittr)
library(knitr)
library(tidyverse)
library(plyr)
library(dplyr)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(lattice)
library(ggplot2)
library(corrplot)
library(summarytools)
library(DT)
library(kableExtra)
library(DescTools)
library(qcc)
library(agricolae)
library(car)
library(tidyverse)
library(RColorBrewer)
library(corrplot)
library(psych)
library(dplyr)
library(ggplot2)
library(gtools)
library(ggfortify)
library(GGally)
library(readr)
library(readxl)
library(knitr)
library(modelr)
library(scales)
library(lmtest)
library(olsrr)
library(leaps)
library(tibble)
library(sjPlot)
library(performance)
library(see)
# Load the data
Ames <- read_csv("~/Desktop/Intro To Analytics - ALY 6000/ALY 6000 - Project/Data Sets/AmesHousing.csv")
# Disabling scientific notation, so my graphs and outputs will be more readable:
options(scipen = 100)
2.2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.
Exploratory Data Analysis (EDA) is performed in order to better understand the underlying structure of the data, and to identify patterns, relationships, and outliers in the data set. It is an initial step in the data analysis process that helps to inform the decisions that will be made later in the analysis, such as which statistical models to use and which features to include in the models. Additionally, EDA can help to identify any issues or problems with the data, such as missing values or outliers, so that they can be addressed before modeling begins.
# 2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.
###########################################################################
# Histogram of prices
ggplot(Ames, aes(x = SalePrice)) +
geom_histogram(color = "black", fill = "#ed610b", bins = 50) +
labs(title = "Graph 1: Distribution of house prices", x = "Price", y = "Frequency") +
theme_minimal()
barplot(table(Ames$"Yr Sold"),
main = "Graph 2: When were the most houses Sold?",
xlab = "Year",
ylab = "Number of houses",
col = brewer.pal(9, "Blues")
)
barplot(table(Ames$"Overall Qual"),
main = "Graph 3: In what Quality are the most houses on the market?",
xlab = "Year",
ylab = "Number of houses",
col = brewer.pal(10, "RdYlBu")
)
# Histogram of Living area
ggplot(Ames, aes_string(x = "`Gr Liv Area`")) +
geom_histogram(color = "black", fill = "#2c0ce6", bins = 30) +
scale_x_continuous(labels = comma) +
labs(title = "Graph 4: Distribution of House Lot Area", x = "Living area (sqft)", y = "Frequency") +
theme_minimal()
# Let's see median prices per neighborhood
neighbourhoods <- tapply(Ames$SalePrice, Ames$Neighborhood, median)
neighbourhoods <- sort(neighbourhoods, decreasing = TRUE)
dotchart(neighbourhoods,
pch = 21, bg = "purple1",
cex = 0.85,
xlab = "Average price of a house",
main = "Graph 5: Which neighborhood is the most expensive to buy a house in?"
)
Observations
From Graph 1 we notice that, the house prices are rightly-skewed distributed with a majority of them being priced below $200,000. The data shows that the prices range from $12,789 to $755,000, with an average of $180,796 and a median price of $160,000.
From Graph 2, We can see that most houses were sold in 2007, and suddenly we see reduction in 2008 because of Subprime mortgage crisis.
From Graph 3, It appears that the majority of houses available are of average condition, with more well-maintained houses than those that are below average.
From Graph 4, It appears that the majority of houses have a square footage of less than 2000 sqft. The data shows that the average square footage is 1500 sqft and the median is 1442 sqft.
In Graph 5, I chose to use the median instead of the average because it is less affected by outliers, such as a single house with an extremely high value. The graph illustrates that the location of the neighborhood plays a significant role in determining the house prices, with the most expensive areas having prices three times higher than the least expensive areas. We can see that Stone Br Locality is the most expensive to buy a house in Ames.
2.3. Prepare the dataset for modeling by imputing missing values with the variable’s mean value
In this section, we are going to clean the dataset for modeling by imputing missing values with the variable’s mean value in “Mas Vnr Area” Variable
# Firstly, I'd like to check the missing values
na_count <- sapply(Ames, function(Ames) sum(length(which(is.na(Ames)))))
na_count
## Order PID MS SubClass MS Zoning Lot Frontage
## 0 0 0 0 490
## Lot Area Street Alley Lot Shape Land Contour
## 0 0 2732 0 0
## Utilities Lot Config Land Slope Neighborhood Condition 1
## 0 0 0 0 0
## Condition 2 Bldg Type House Style Overall Qual Overall Cond
## 0 0 0 0 0
## Year Built Year Remod/Add Roof Style Roof Matl Exterior 1st
## 0 0 0 0 0
## Exterior 2nd Mas Vnr Type Mas Vnr Area Exter Qual Exter Cond
## 0 23 23 0 0
## Foundation Bsmt Qual Bsmt Cond Bsmt Exposure BsmtFin Type 1
## 0 80 80 83 80
## BsmtFin SF 1 BsmtFin Type 2 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF
## 1 81 1 1 1
## Heating Heating QC Central Air Electrical 1st Flr SF
## 0 0 0 1 0
## 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath
## 0 0 0 2 2
## Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr Kitchen Qual
## 0 0 0 0 0
## TotRms AbvGrd Functional Fireplaces Fireplace Qu Garage Type
## 0 0 0 1422 157
## Garage Yr Blt Garage Finish Garage Cars Garage Area Garage Qual
## 159 159 1 1 159
## Garage Cond Paved Drive Wood Deck SF Open Porch SF Enclosed Porch
## 159 0 0 0 0
## 3Ssn Porch Screen Porch Pool Area Pool QC Fence
## 0 0 0 2917 2358
## Misc Feature Misc Val Mo Sold Yr Sold Sale Type
## 2824 0 0 0 0
## Sale Condition SalePrice
## 0 0
# 3. Imputation of Mean Value in "Mas Vnr Area" Variable
#################################################################
Ames$"Mas Vnr Area"[is.na(Ames$"Mas Vnr Area")] <- mean(Ames$"Mas Vnr Area", na.rm = TRUE)
2.4. Use the “cor()” function to produce a correlation matrix of the numeric values.
In this section, we are going to use the “cor()” function to produce a correlation matrix of the numeric values. Produce a plot of the correlation matrix, and explain how to interpret it.
A correlation matrix is a table showing the correlation coefficients between multiple variables. It is an important tool in identifying which variables are related to each other and the strength of the relationship.
The correlation coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation, 0 indicating no correlation and 1 indicating a perfect positive correlation.
Correlation matrix is important in various ways:
Identifying multicollinearity: It is a problem when two or more independent variables are highly correlated, this can cause problems in statistical models, such as linear regression.
Identifying patterns in data: It can help identify which variables are related and which are not. This can be useful in feature selection and modeling.
Identifying outliers: It can help identify outliers or extreme values in the data by identifying large correlation coefficients.
Identifying potential confounding variables: It can help identify potential confounding variables in observational studies.
Identifying which variables to include in a model: By identifying the relationship between different variables, it can help determine which variables should be included in a model.
Overall, correlation matrix is an important exploratory data analysis tool that helps to understand the relationships between different variables in a data set.
# 4. Use the "cor()" function to produce a correlation matrix of the numeric values.
###################################################################################
# Creating data subset without character variables
data.only.numeric <- Ames[, !sapply(Ames, is.character)]
only.numeric.noNA <- na.omit(data.only.numeric)
correlation.matrix <- cor(only.numeric.noNA, method = "pearson")
# Rounding off the digits in Table
table2 <- round((correlation.matrix), digits = 2)
# Present the table using kableExta Package
knitr::kable(table2,
caption = "Table 2: Descriptive Statistics of MPG Data Set Using
Code psych::describe () ",
format = "html",
table.attr = "style=width: 40%",
font_size = 8
) %>%
kable_styling(bootstrap_options = c(
"striped", "hover",
"condensed", "responsive"
)) %>%
kable_classic(
full_width = F,
html_font = "Times New Roman"
)
| Order | Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | TotRms AbvGrd | Fireplaces | Garage Yr Blt | Garage Cars | Garage Area | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | Mo Sold | Yr Sold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Order | 1.00 | 0.00 | 0.03 | -0.06 | 0.00 | -0.07 | -0.08 | -0.03 | -0.03 | -0.02 | 0.00 | -0.04 | -0.02 | 0.02 | -0.01 | 0.00 | -0.04 | 0.02 | -0.05 | -0.02 | 0.03 | -0.01 | 0.02 | -0.01 | -0.06 | -0.04 | -0.04 | -0.01 | 0.02 | 0.03 | -0.02 | 0.01 | 0.05 | -0.01 | 0.14 | -0.98 | -0.04 |
| Lot Frontage | 0.00 | 1.00 | 0.49 | 0.20 | -0.07 | 0.11 | 0.09 | 0.22 | 0.22 | 0.04 | 0.11 | 0.35 | 0.45 | 0.03 | -0.01 | 0.38 | 0.11 | -0.02 | 0.17 | 0.04 | 0.25 | 0.00 | 0.36 | 0.25 | 0.08 | 0.32 | 0.37 | 0.12 | 0.17 | 0.02 | 0.03 | 0.07 | 0.18 | 0.05 | 0.01 | -0.01 | 0.35 |
| Lot Area | 0.03 | 0.49 | 1.00 | 0.14 | -0.06 | 0.05 | 0.05 | 0.14 | 0.22 | 0.10 | 0.03 | 0.29 | 0.36 | 0.05 | 0.01 | 0.33 | 0.15 | -0.02 | 0.14 | 0.06 | 0.16 | -0.02 | 0.27 | 0.24 | 0.04 | 0.22 | 0.26 | 0.16 | 0.12 | 0.02 | 0.01 | 0.08 | 0.13 | 0.08 | 0.01 | -0.02 | 0.31 |
| Overall Qual | -0.06 | 0.20 | 0.14 | 1.00 | -0.17 | 0.61 | 0.58 | 0.44 | 0.29 | -0.06 | 0.30 | 0.57 | 0.52 | 0.21 | -0.04 | 0.59 | 0.18 | -0.05 | 0.56 | 0.24 | 0.06 | -0.14 | 0.41 | 0.39 | 0.58 | 0.60 | 0.56 | 0.27 | 0.33 | -0.16 | 0.00 | 0.03 | 0.03 | 0.02 | 0.03 | -0.01 | 0.80 |
| Overall Cond | 0.00 | -0.07 | -0.06 | -0.17 | 1.00 | -0.44 | -0.01 | -0.17 | -0.08 | 0.05 | -0.15 | -0.21 | -0.20 | 0.00 | 0.02 | -0.16 | -0.05 | 0.09 | -0.26 | -0.12 | -0.01 | -0.08 | -0.12 | -0.04 | -0.35 | -0.28 | -0.25 | -0.01 | -0.11 | 0.09 | 0.01 | 0.05 | -0.03 | 0.02 | -0.01 | 0.03 | -0.17 |
| Year Built | -0.07 | 0.11 | 0.05 | 0.61 | -0.44 | 1.00 | 0.64 | 0.33 | 0.27 | -0.04 | 0.17 | 0.43 | 0.34 | 0.00 | -0.13 | 0.25 | 0.21 | -0.04 | 0.51 | 0.25 | -0.05 | -0.12 | 0.14 | 0.15 | 0.83 | 0.54 | 0.48 | 0.23 | 0.24 | -0.38 | 0.02 | -0.06 | 0.00 | -0.01 | 0.02 | 0.00 | 0.56 |
| Year Remod/Add | -0.08 | 0.09 | 0.05 | 0.58 | -0.01 | 0.64 | 1.00 | 0.21 | 0.14 | -0.06 | 0.19 | 0.32 | 0.28 | 0.13 | -0.06 | 0.32 | 0.13 | -0.06 | 0.49 | 0.18 | -0.04 | -0.15 | 0.21 | 0.14 | 0.65 | 0.47 | 0.41 | 0.23 | 0.28 | -0.23 | 0.02 | -0.05 | -0.01 | 0.00 | 0.03 | 0.04 | 0.54 |
| Mas Vnr Area | -0.03 | 0.22 | 0.14 | 0.44 | -0.17 | 0.33 | 0.21 | 1.00 | 0.32 | -0.04 | 0.10 | 0.42 | 0.43 | 0.12 | -0.05 | 0.43 | 0.16 | -0.01 | 0.28 | 0.18 | 0.10 | -0.02 | 0.32 | 0.28 | 0.27 | 0.37 | 0.39 | 0.18 | 0.14 | -0.13 | 0.01 | 0.06 | 0.01 | 0.07 | 0.01 | -0.02 | 0.53 |
| BsmtFin SF 1 | -0.03 | 0.22 | 0.22 | 0.29 | -0.08 | 0.27 | 0.14 | 0.32 | 1.00 | -0.05 | -0.48 | 0.55 | 0.49 | -0.17 | -0.06 | 0.24 | 0.64 | 0.07 | 0.07 | 0.00 | -0.12 | -0.06 | 0.09 | 0.30 | 0.20 | 0.23 | 0.30 | 0.22 | 0.15 | -0.10 | 0.04 | 0.09 | 0.11 | 0.13 | 0.01 | 0.02 | 0.44 |
| BsmtFin SF 2 | -0.02 | 0.04 | 0.10 | -0.06 | 0.05 | -0.04 | -0.06 | -0.04 | -0.05 | 1.00 | -0.25 | 0.07 | 0.05 | -0.10 | -0.01 | -0.04 | 0.17 | 0.11 | -0.08 | -0.03 | -0.02 | -0.02 | -0.07 | 0.04 | -0.06 | -0.06 | -0.03 | 0.08 | -0.01 | 0.02 | -0.03 | 0.05 | 0.06 | 0.00 | -0.01 | 0.02 | -0.02 |
| Bsmt Unf SF | 0.00 | 0.11 | 0.03 | 0.30 | -0.15 | 0.17 | 0.19 | 0.10 | -0.48 | -0.25 | 1.00 | 0.40 | 0.30 | 0.00 | 0.03 | 0.24 | -0.41 | -0.11 | 0.30 | -0.04 | 0.16 | 0.03 | 0.23 | 0.01 | 0.20 | 0.25 | 0.22 | -0.02 | 0.13 | -0.01 | -0.01 | -0.04 | -0.04 | -0.01 | 0.01 | -0.04 | 0.20 |
| Total Bsmt SF | -0.04 | 0.35 | 0.29 | 0.57 | -0.21 | 0.43 | 0.32 | 0.42 | 0.55 | 0.07 | 0.40 | 1.00 | 0.83 | -0.21 | -0.03 | 0.47 | 0.33 | 0.01 | 0.35 | -0.06 | 0.03 | -0.05 | 0.30 | 0.33 | 0.38 | 0.47 | 0.52 | 0.24 | 0.28 | -0.11 | 0.02 | 0.07 | 0.09 | 0.13 | 0.02 | -0.01 | 0.65 |
| 1st Flr SF | -0.02 | 0.45 | 0.36 | 0.52 | -0.20 | 0.34 | 0.28 | 0.43 | 0.49 | 0.05 | 0.30 | 0.83 | 1.00 | -0.26 | -0.01 | 0.57 | 0.27 | 0.01 | 0.37 | -0.11 | 0.07 | 0.07 | 0.39 | 0.40 | 0.31 | 0.48 | 0.53 | 0.24 | 0.27 | -0.09 | 0.02 | 0.10 | 0.14 | 0.14 | 0.04 | -0.01 | 0.64 |
| 2nd Flr SF | 0.02 | 0.03 | 0.05 | 0.21 | 0.00 | 0.00 | 0.13 | 0.12 | -0.17 | -0.10 | 0.00 | -0.21 | -0.26 | 1.00 | 0.01 | 0.64 | -0.17 | -0.06 | 0.38 | 0.60 | 0.51 | 0.05 | 0.58 | 0.16 | 0.05 | 0.18 | 0.12 | 0.09 | 0.17 | 0.07 | -0.03 | 0.01 | 0.05 | -0.02 | 0.01 | -0.04 | 0.25 |
| Low Qual Fin SF | -0.01 | -0.01 | 0.01 | -0.04 | 0.02 | -0.13 | -0.06 | -0.05 | -0.06 | -0.01 | 0.03 | -0.03 | -0.01 | 0.01 | 1.00 | 0.09 | -0.04 | -0.01 | 0.00 | -0.03 | 0.06 | -0.02 | 0.08 | 0.01 | -0.06 | -0.02 | -0.01 | -0.01 | 0.01 | 0.10 | 0.00 | 0.02 | 0.05 | 0.00 | 0.01 | 0.02 | -0.03 |
| Gr Liv Area | 0.00 | 0.38 | 0.33 | 0.59 | -0.16 | 0.25 | 0.32 | 0.43 | 0.24 | -0.04 | 0.24 | 0.47 | 0.57 | 0.64 | 0.09 | 1.00 | 0.07 | -0.05 | 0.62 | 0.42 | 0.50 | 0.09 | 0.81 | 0.46 | 0.28 | 0.53 | 0.52 | 0.26 | 0.36 | -0.01 | -0.01 | 0.08 | 0.15 | 0.09 | 0.04 | -0.04 | 0.71 |
| Bsmt Full Bath | -0.04 | 0.11 | 0.15 | 0.18 | -0.05 | 0.21 | 0.13 | 0.16 | 0.64 | 0.17 | -0.41 | 0.33 | 0.27 | -0.17 | -0.04 | 0.07 | 1.00 | -0.13 | -0.04 | -0.03 | -0.17 | -0.02 | -0.02 | 0.17 | 0.15 | 0.16 | 0.20 | 0.16 | 0.08 | -0.07 | 0.02 | 0.05 | 0.06 | 0.01 | 0.01 | 0.04 | 0.28 |
| Bsmt Half Bath | 0.02 | -0.02 | -0.02 | -0.05 | 0.09 | -0.04 | -0.06 | -0.01 | 0.07 | 0.11 | -0.11 | 0.01 | 0.01 | -0.06 | -0.01 | -0.05 | -0.13 | 1.00 | -0.07 | -0.05 | 0.00 | -0.03 | -0.06 | 0.04 | -0.06 | -0.05 | -0.04 | 0.08 | -0.05 | 0.00 | 0.04 | 0.02 | 0.09 | 0.05 | 0.01 | -0.01 | -0.05 |
| Full Bath | -0.05 | 0.17 | 0.14 | 0.56 | -0.26 | 0.51 | 0.49 | 0.28 | 0.07 | -0.08 | 0.30 | 0.35 | 0.37 | 0.38 | 0.00 | 0.62 | -0.04 | -0.07 | 1.00 | 0.14 | 0.33 | 0.12 | 0.52 | 0.23 | 0.51 | 0.53 | 0.45 | 0.18 | 0.29 | -0.15 | 0.00 | -0.01 | 0.03 | -0.01 | 0.05 | -0.01 | 0.56 |
| Half Bath | -0.02 | 0.04 | 0.06 | 0.24 | -0.12 | 0.25 | 0.18 | 0.18 | 0.00 | -0.03 | -0.04 | -0.06 | -0.11 | 0.60 | -0.03 | 0.42 | -0.03 | -0.05 | 0.14 | 1.00 | 0.25 | -0.05 | 0.36 | 0.19 | 0.21 | 0.22 | 0.15 | 0.11 | 0.18 | -0.07 | -0.02 | 0.03 | 0.01 | 0.03 | -0.01 | -0.01 | 0.27 |
| Bedroom AbvGr | 0.03 | 0.25 | 0.16 | 0.06 | -0.01 | -0.05 | -0.04 | 0.10 | -0.12 | -0.02 | 0.16 | 0.03 | 0.07 | 0.51 | 0.06 | 0.50 | -0.17 | 0.00 | 0.33 | 0.25 | 1.00 | 0.19 | 0.65 | 0.09 | -0.05 | 0.13 | 0.10 | 0.04 | 0.06 | 0.05 | -0.05 | 0.02 | 0.04 | 0.00 | 0.04 | -0.04 | 0.14 |
| Kitchen AbvGr | -0.01 | 0.00 | -0.02 | -0.14 | -0.08 | -0.12 | -0.15 | -0.02 | -0.06 | -0.02 | 0.03 | -0.05 | 0.07 | 0.05 | -0.02 | 0.09 | -0.02 | -0.03 | 0.12 | -0.05 | 0.19 | 1.00 | 0.25 | -0.09 | -0.10 | 0.08 | 0.04 | -0.09 | -0.07 | 0.00 | -0.02 | -0.05 | -0.01 | -0.01 | 0.03 | 0.03 | -0.11 |
| TotRms AbvGrd | 0.02 | 0.36 | 0.27 | 0.41 | -0.12 | 0.14 | 0.21 | 0.32 | 0.09 | -0.07 | 0.23 | 0.30 | 0.39 | 0.58 | 0.08 | 0.81 | -0.02 | -0.06 | 0.52 | 0.36 | 0.65 | 0.25 | 1.00 | 0.33 | 0.17 | 0.43 | 0.39 | 0.18 | 0.24 | 0.00 | -0.04 | 0.03 | 0.08 | 0.07 | 0.04 | -0.05 | 0.52 |
| Fireplaces | -0.01 | 0.25 | 0.24 | 0.39 | -0.04 | 0.15 | 0.14 | 0.28 | 0.30 | 0.04 | 0.01 | 0.33 | 0.40 | 0.16 | 0.01 | 0.46 | 0.17 | 0.04 | 0.23 | 0.19 | 0.09 | -0.09 | 0.33 | 1.00 | 0.10 | 0.27 | 0.24 | 0.22 | 0.17 | -0.01 | 0.01 | 0.17 | 0.12 | 0.02 | 0.02 | -0.01 | 0.46 |
| Garage Yr Blt | -0.06 | 0.08 | 0.04 | 0.58 | -0.35 | 0.83 | 0.65 | 0.27 | 0.20 | -0.06 | 0.20 | 0.38 | 0.31 | 0.05 | -0.06 | 0.28 | 0.15 | -0.06 | 0.51 | 0.21 | -0.05 | -0.10 | 0.17 | 0.10 | 1.00 | 0.60 | 0.58 | 0.24 | 0.25 | -0.31 | 0.02 | -0.06 | -0.01 | 0.00 | 0.03 | 0.00 | 0.54 |
| Garage Cars | -0.04 | 0.32 | 0.22 | 0.60 | -0.28 | 0.54 | 0.47 | 0.37 | 0.23 | -0.06 | 0.25 | 0.47 | 0.48 | 0.18 | -0.02 | 0.53 | 0.16 | -0.05 | 0.53 | 0.22 | 0.13 | 0.08 | 0.43 | 0.27 | 0.60 | 1.00 | 0.85 | 0.23 | 0.25 | -0.14 | 0.01 | 0.01 | 0.03 | -0.01 | 0.06 | -0.02 | 0.66 |
| Garage Area | -0.04 | 0.37 | 0.26 | 0.56 | -0.25 | 0.48 | 0.41 | 0.39 | 0.30 | -0.03 | 0.22 | 0.52 | 0.53 | 0.12 | -0.01 | 0.52 | 0.20 | -0.04 | 0.45 | 0.15 | 0.10 | 0.04 | 0.39 | 0.24 | 0.58 | 0.85 | 1.00 | 0.23 | 0.29 | -0.10 | 0.01 | 0.04 | 0.06 | 0.03 | 0.04 | -0.01 | 0.65 |
| Wood Deck SF | -0.01 | 0.12 | 0.16 | 0.27 | -0.01 | 0.23 | 0.23 | 0.18 | 0.22 | 0.08 | -0.02 | 0.24 | 0.24 | 0.09 | -0.01 | 0.26 | 0.16 | 0.08 | 0.18 | 0.11 | 0.04 | -0.09 | 0.18 | 0.22 | 0.24 | 0.23 | 0.23 | 1.00 | 0.05 | -0.11 | -0.04 | -0.06 | 0.09 | 0.09 | 0.02 | -0.01 | 0.33 |
| Open Porch SF | 0.02 | 0.17 | 0.12 | 0.33 | -0.11 | 0.24 | 0.28 | 0.14 | 0.15 | -0.01 | 0.13 | 0.28 | 0.27 | 0.17 | 0.01 | 0.36 | 0.08 | -0.05 | 0.29 | 0.18 | 0.06 | -0.07 | 0.24 | 0.17 | 0.25 | 0.25 | 0.29 | 0.05 | 1.00 | -0.08 | -0.01 | 0.07 | 0.06 | 0.11 | 0.05 | -0.04 | 0.34 |
| Enclosed Porch | 0.03 | 0.02 | 0.02 | -0.16 | 0.09 | -0.38 | -0.23 | -0.13 | -0.10 | 0.02 | -0.01 | -0.11 | -0.09 | 0.07 | 0.10 | -0.01 | -0.07 | 0.00 | -0.15 | -0.07 | 0.05 | 0.00 | 0.00 | -0.01 | -0.31 | -0.14 | -0.10 | -0.11 | -0.08 | 1.00 | -0.03 | -0.06 | 0.12 | 0.00 | -0.03 | 0.00 | -0.14 |
| 3Ssn Porch | -0.02 | 0.03 | 0.01 | 0.00 | 0.01 | 0.02 | 0.02 | 0.01 | 0.04 | -0.03 | -0.01 | 0.02 | 0.02 | -0.03 | 0.00 | -0.01 | 0.02 | 0.04 | 0.00 | -0.02 | -0.05 | -0.02 | -0.04 | 0.01 | 0.02 | 0.01 | 0.01 | -0.04 | -0.01 | -0.03 | 1.00 | -0.03 | -0.01 | 0.00 | 0.03 | 0.02 | 0.01 |
| Screen Porch | 0.01 | 0.07 | 0.08 | 0.03 | 0.05 | -0.06 | -0.05 | 0.06 | 0.09 | 0.05 | -0.04 | 0.07 | 0.10 | 0.01 | 0.02 | 0.08 | 0.05 | 0.02 | -0.01 | 0.03 | 0.02 | -0.05 | 0.03 | 0.17 | -0.06 | 0.01 | 0.04 | -0.06 | 0.07 | -0.06 | -0.03 | 1.00 | 0.03 | 0.02 | 0.03 | -0.02 | 0.11 |
| Pool Area | 0.05 | 0.18 | 0.13 | 0.03 | -0.03 | 0.00 | -0.01 | 0.01 | 0.11 | 0.06 | -0.04 | 0.09 | 0.14 | 0.05 | 0.05 | 0.15 | 0.06 | 0.09 | 0.03 | 0.01 | 0.04 | -0.01 | 0.08 | 0.12 | -0.01 | 0.03 | 0.06 | 0.09 | 0.06 | 0.12 | -0.01 | 0.03 | 1.00 | 0.02 | -0.06 | -0.05 | 0.07 |
| Misc Val | -0.01 | 0.05 | 0.08 | 0.02 | 0.02 | -0.01 | 0.00 | 0.07 | 0.13 | 0.00 | -0.01 | 0.13 | 0.14 | -0.02 | 0.00 | 0.09 | 0.01 | 0.05 | -0.01 | 0.03 | 0.00 | -0.01 | 0.07 | 0.02 | 0.00 | -0.01 | 0.03 | 0.09 | 0.11 | 0.00 | 0.00 | 0.02 | 0.02 | 1.00 | 0.02 | 0.01 | -0.01 |
| Mo Sold | 0.14 | 0.01 | 0.01 | 0.03 | -0.01 | 0.02 | 0.03 | 0.01 | 0.01 | -0.01 | 0.01 | 0.02 | 0.04 | 0.01 | 0.01 | 0.04 | 0.01 | 0.01 | 0.05 | -0.01 | 0.04 | 0.03 | 0.04 | 0.02 | 0.03 | 0.06 | 0.04 | 0.02 | 0.05 | -0.03 | 0.03 | 0.03 | -0.06 | 0.02 | 1.00 | -0.17 | 0.04 |
| Yr Sold | -0.98 | -0.01 | -0.02 | -0.01 | 0.03 | 0.00 | 0.04 | -0.02 | 0.02 | 0.02 | -0.04 | -0.01 | -0.01 | -0.04 | 0.02 | -0.04 | 0.04 | -0.01 | -0.01 | -0.01 | -0.04 | 0.03 | -0.05 | -0.01 | 0.00 | -0.02 | -0.01 | -0.01 | -0.04 | 0.00 | 0.02 | -0.02 | -0.05 | 0.01 | -0.17 | 1.00 | -0.03 |
| SalePrice | -0.04 | 0.35 | 0.31 | 0.80 | -0.17 | 0.56 | 0.54 | 0.53 | 0.44 | -0.02 | 0.20 | 0.65 | 0.64 | 0.25 | -0.03 | 0.71 | 0.28 | -0.05 | 0.56 | 0.27 | 0.14 | -0.11 | 0.52 | 0.46 | 0.54 | 0.66 | 0.65 | 0.33 | 0.34 | -0.14 | 0.01 | 0.11 | 0.07 | -0.01 | 0.04 | -0.03 | 1.00 |
2.5. Produce a plot of the correlation matrix, and explain how to interpret it.
Interpreting a correlation matrix is fairly simple, it shows the correlation coefficients between different variables in the form of a table.
The diagonal elements of the matrix are always 1, as a variable is always perfectly correlated with itself.
The correlation coefficient ranges from -1 to 1.
A coefficient of 1 indicates a perfect positive correlation, which means that as one variable increases, the other variable also increases.
A coefficient of -1 indicates a perfect negative correlation, which means that as one variable increases, the other variable decreases.
A coefficient of 0 indicates no correlation, which means that the variables are independent of each other.
Values close to 1 or -1 indicate a strong correlation, while values close to 0 indicate a weak correlation.
Identify the variables that have a correlation coefficient greater than a certain threshold, usually taken as 0.7 or 0.8, these are highly correlated variables and are potential candidates for multicollinearity.
Identify the variables that have a correlation coefficient close to 0, these are variables that are not correlated with any other variable in the dataset, and are not useful in the model.
Identify the variables that have a correlation coefficient close to 1 or -1, these are variables that are highly correlated with other variables and can be useful in the model.
In summary, interpreting a correlation matrix is a useful tool to understand the relationships between different variables in a data set, it helps identifying the correlated variables and to avoid multicollinearity problem and also it can help in selecting the features for a predictive model.
# 5. Produce a plot of the correlation matrix, and explain how to interpret it.
###################################################################################
corrplot::corrplot(cor(correlation.matrix), tl.cex = 0.5)
2.6. Make a scatter plot for the X continuous variable with the highest correlation with SalePrice.
In this section, we are make scatter plot for the X continuous variable with the highest correlation with SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice. Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.
# 6. Make a scatter plot for the X continuous variable with the highest correlation with
# SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice.
# Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.
# Variable with highest correlation with SalePrice
################################################
# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XOverallQuality <- c(Ames$`Overall Qual`)
# Using the linear Regression Formula
linearReg2.6 <- lm(YSalePrice ~ XOverallQuality)
# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.6)
# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]
Slope2.6 <- SumData2.6$coefficients[[2]]
# Plotting the Scatter Plot
plot(
YSalePrice ~ XOverallQuality,
pch = 19,
col = "blue",
xlab = "Overall Quality",
ylab = "Sales Price",
main = "Plot 1: Linear Regression: Overall Quality and Sale Price "
)
# Adding Lines and Text in Scatter Plot
abline(linearReg2.6, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line
abline(v = 0, lwd = 2)
abline(h = 0, lwd = 2)
# Variable with lowest correlation with SalePrice
################################################
# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XMiscVal <- c(Ames$`Misc Val`)
# Using the linear Regression Formula
linearReg2.7 <- lm(YSalePrice ~ XMiscVal)
# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.7)
# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]
Slope2.6 <- SumData2.6$coefficients[[2]]
# Plotting the Scatter Plot
plot(
YSalePrice ~ XMiscVal,
pch = 19,
col = "#ff6600",
xlab = "Miscellaneous feature",
ylab = "Sales Price",
main = "Plot 2: Linear Regression: Miscellaneous feature and Sale Price "
)
# Adding Lines and Text in Scatter Plot
abline(linearReg2.7, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line
abline(v = 0, lwd = 2)
abline(h = 0, lwd = 2)
# 7. Variable with a correlation closest to 0.5
###############################################
# Creating Objects for Analysis
YSalePrice <- c(Ames$SalePrice)
XTotRmsAbvGrd <- c(Ames$`TotRms AbvGrd`)
# Using the linear Regression Formula
linearReg2.8 <- lm(YSalePrice ~ XTotRmsAbvGrd)
# Creatinng an object to store the summary of the linear regression
SumData2.6 <- summary(linearReg2.8)
# Extracting Values and Creating Object to store the value of Intercept and Slope
Intercept2.6 <- SumData2.6$coefficients[[1]]
Slope2.6 <- SumData2.6$coefficients[[2]]
# Plotting the Scatter Plot
plot(
YSalePrice ~ XTotRmsAbvGrd,
pch = 19,
col = "#f6055d",
xlab = "Total rooms above grade",
ylab = "Sales Price",
main = "Plot 3: Linear Regression: Total rooms above grade and Sale Price "
)
# Adding Lines and Text in Scatter Plot
abline(linearReg2.8, col = "#99004C", lty = 2, lwd = 2) # Adding the Regression Line
abline(v = 0, lwd = 2)
abline(h = 0, lwd = 2)
Observations
From Plot 1 and 3, it can be clearly seen that when the price increases both the total rooms above grade and the house quality also increase. The line shows a linear model of a relationship between the living area and the house price. It can also be seen that there are some unusual observations present in the dataset.
From Plot 2, we can Miscellaneous feature and Sale Price have strong negative linear relationhip
2.7. Using at least 3 continuous variables, fit a regression model in R.
# 7. Using at least 3 continuous variables, fit a regression model in R.
# Creating regression model
attach(only.numeric.noNA)
Table_regression <- lm(SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
tab_model(Table_regression)
| SalePrice | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -41364.17 | -48085.46 – -34642.88 | <0.001 |
| Garage Area | 114.57 | 101.94 – 127.20 | <0.001 |
| Gr Liv Area | 72.24 | 67.53 – 76.96 | <0.001 |
| Total Bsmt SF | 56.50 | 51.20 – 61.81 | <0.001 |
| Observations | 2290 | ||
| R2 / R2 adjusted | 0.678 / 0.677 | ||
2.8. Report the model in equation form and interpret each coefficient of the model in the context of this problem.
summary(Table_regression)
##
## Call:
## lm(formula = SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -713929 -19690 748 19829 256637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41364.167 3427.478 -12.07 <0.0000000000000002 ***
## `Garage Area` 114.570 6.442 17.79 <0.0000000000000002 ***
## `Gr Liv Area` 72.245 2.405 30.04 <0.0000000000000002 ***
## `Total Bsmt SF` 56.502 2.705 20.89 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47430 on 2286 degrees of freedom
## Multiple R-squared: 0.6776, Adjusted R-squared: 0.6772
## F-statistic: 1601 on 3 and 2286 DF, p-value: < 0.00000000000000022
The equation representing my multiple linear regression is as follows:
y = -41364.17 + 114.57 * Garage Area + 72.24 * Gr Liv Area + 56.50 * Total Bsmt SF
2.9. Use the “plot()” function to plot your regression model.
# b. Plotting regression model
par(mfrow = c(2, 2))
plot(Table_regression)
2.10. Checking model for multicollinearity and report your findings.
There are several ways to address multicollinearity if it exists in a multiple regression analysis:
Remove one or more of the correlated predictor variables. This can be done by examining the correlation matrix and removing the variable with the highest correlation with the other predictors.
Combine correlated predictor variables into a single composite variable. This can be done using factor analysis or principal component analysis.
Use ridge regression or lasso regression, which are types of regularization that can reduce the standard errors of the estimates and make the model more stable.
Use a different model altogether, such as decision trees or random forests, which are less sensitive to multicollinearity.
It’s important to note that in practice, a combination of these methods is often employed to tackle multicollinearity.
# 10. Checking the model for multicollinearity
######################################
vif(Table_regression)
## `Garage Area` `Gr Liv Area` `Total Bsmt SF`
## 1.584664 1.476125 1.490483
Observations
Our model has failed to meet the Homoscedasticity assumption as indicated by the non-random scattering of points in the Scale-Location plot. Additionally, the points in the Normal Q-Q plot deviate from the line, although the deviation is not extreme.
Furthermore, there are a few outliers or atypical observations present in both the residuals vs fitted plot and the residuals vs leverage plot.
2.11. Looking for unusual observations or outliers
# 11. Looking for unusual observations or outliers
#############################################
outlierTest(model = Table_regression)
## rstudent unadjusted p-value
## 1168 -16.452672 0.0000000000000000000000000000000000000000000000000000000014378
## 1702 -12.552210 0.0000000000000000000000000000000000536599999999999995957515797
## 1703 -8.447971 0.0000000000000000519539999999999994968174051925162646998077632
## 39 5.456008 0.0000000539519999999999989794678982493042473933542169106658548
## 1373 5.369376 0.0000000870020000000000004485083094848962836920236441073939204
## 834 5.193804 0.0000002242700000000000090895948537048076865119128342485055327
## 1368 5.013711 0.0000005749200000000000156154538084873895087412165594287216663
## 337 4.723603 0.0000024575999999999999836566497157797073214169358834624290466
## 2016 -4.402661 0.0000111859999999999992884068891751958574332093121483922004700
## 338 4.361758 0.0000134750000000000005813708889301771876034763408824801445007
## Bonferroni p
## 1168 0.0000000000000000000000000000000000000000000000000000032926
## 1702 0.0000000000000000000000000000001228799999999999954917130272
## 1703 0.0000000000001189699999999999969265483883002692554772643241
## 39 0.0001235499999999999944620687752916410317993722856044769287
## 1373 0.0001992300000000000017100210136788973613874986767768859863
## 834 0.0005135800000000000260780286254203019780106842517852783203
## 1368 0.0013166000000000000185601534141710544645320624113082885742
## 337 0.0056278999999999999165334330086807312909513711929321289062
## 2016 0.0256149999999999988808951911778422072529792785644531250000
## 338 0.0308580000000000000126565424807267845608294010162353515625
hat.plot <- function(fit) {
p <- length(coefficients(Table_regression))
n <- length(fitted(Table_regression))
plot(hatvalues(Table_regression), main = "Index Plot of hat Values")
abline(h = c(2, 3) * p / n, col = "red", lty = 2)
identify(1:n, hatvalues(Table_regression), names(hatvalues(Table_regression)))
}
ols_plot_cooksd_chart(Table_regression)
par(mfrow = c(1, 1))
hat.plot(Table_regression)
## integer(0)
Observations
2.12 Removing unusual observations to improve model
# 12. Eliminating unusual observations to improve model
#############################################
cooksd <- cooks.distance(Table_regression)
sample_size <- nrow(data.only.numeric)
influential <- as.numeric(names(cooksd)[(cooksd > (4 / sample_size))])
only.numeric.no.outliers <- only.numeric.noNA[-influential, ]
# a. Looking at model now
attach(only.numeric.no.outliers)
Table_regression2 <- lm(SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
par(mfrow = c(2, 2))
plot(Table_regression2)
summary(Table_regression2)
##
## Call:
## lm(formula = SalePrice ~ `Garage Area` + `Gr Liv Area` + `Total Bsmt SF`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117102 -17408 770 17798 93213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39297.888 2520.187 -15.59 <0.0000000000000002 ***
## `Garage Area` 108.632 4.521 24.03 <0.0000000000000002 ***
## `Gr Liv Area` 73.264 1.749 41.88 <0.0000000000000002 ***
## `Total Bsmt SF` 55.144 1.935 28.50 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29150 on 2097 degrees of freedom
## Multiple R-squared: 0.7867, Adjusted R-squared: 0.7864
## F-statistic: 2578 on 3 and 2097 DF, p-value: < 0.00000000000000022
# 12. Attempt to correct any issues that you have discovered in your model. Did your changes improve the model, why or why not?
par(mfrow = c(1, 1))
hist(data.only.numeric$SalePrice)
hist(only.numeric.no.outliers$SalePrice)
Observations
Eliminating influential observations was necessary to improve the model, after which the model’s performance was significantly improved, as shown on the graph.
The Q-Q plot is almost perfect and the points are dispersed on the Scale-Location graph. The main issues of the model were resolved by removing the outliers in the data.
The histogram of the SalePrice shows that the distribution of the data has changed from being skewed to the right to having a normal distribution.
2.13 Use the all subsets regression method to identify the “best” model
# 13. Use the all subsets regression method to identify the "best" model.
########################################################################
regfit_full <- regsubsets(SalePrice ~ ., data = only.numeric.noNA)
## Reordering variables and trying again:
# a. Looking at the model selected by subsets method
model2 <- lm(SalePrice ~ `Overall Qual` + `BsmtFin SF 1` + `Gr Liv Area`)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ `Overall Qual` + `BsmtFin SF 1` + `Gr Liv Area`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126812 -16880 145 17366 124651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -85588.542 2882.940 -29.69 <0.0000000000000002 ***
## `Overall Qual` 25074.378 558.461 44.90 <0.0000000000000002 ***
## `BsmtFin SF 1` 33.779 1.481 22.82 <0.0000000000000002 ***
## `Gr Liv Area` 64.928 1.720 37.75 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27160 on 2097 degrees of freedom
## Multiple R-squared: 0.8148, Adjusted R-squared: 0.8145
## F-statistic: 3074 on 3 and 2097 DF, p-value: < 0.00000000000000022
plot(model2)
Observations
2.14 Compare the preferred model from step 13 with your model from step 12
compare_performance(Table_regression2, model2, rank = TRUE)
## # Comparison of Model Performance Indices
##
## Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
## ------------------------------------------------------------------------------------------------------------------------------------
## model2 | lm | 0.815 | 0.814 | 27136.955 | 27162.824 | 1.00 | 1.00 | 1.00 | 100.00%
## Table_regression2 | lm | 0.787 | 0.786 | 29121.002 | 29148.763 | 4.12e-65 | 4.12e-65 | 4.12e-65 | 0.00%
plot(compare_performance(Table_regression2, model2, rank = TRUE))
Observations
The results indicate that the model that performs the best is the one selected by the subsets method. The results show that the subset method model performed better.
A plot was also created to visually compare the performance of the two models, further confirming that the subset method model, referred as model2, is the superior choice.
3. CONCLUSIONS
4. REFERENCES